Mixed Models with R

tidyverse version
Author
Affiliation

HKUST, SOSC

Published

October 22, 2022

1 Kinds of Structure

In terms of structure, data might have one or multiple sources of clustering, and that clustering maybe hierarchical, such that clusters are nested within ohter cluster. For example, repeated observations nested within students, students nested within schools, schools nested within districts.

2 Data and Initial Visualization

Code
require("pacman")

p_load(tidyverse, stargazer, lme4, gt, gtExtras, skimr, kableExtra, merTools, tidyr, tidytext)

load("D:\\OneDrive - HKUST Connect\\practice\\mixed models\\gpa.Rdata")

2.1 data summary

Code
gpa

Categorical data descriptive statistics:

Code
gpa %>% 
  dplyr::select(where(is.factor) | where(is.character)) %>% 
  skim() %>% 
  as_tibble() %>% 
  dplyr::select(-factor.top_counts) %>% 
  gt() %>% 
  gt_highlight_rows(rows = seq(1, 6, 2),
                    fill = "lightgrey") %>% 
  gt_theme_nytimes()
skim_type skim_variable n_missing complete_rate factor.ordered factor.n_unique
factor student 0 1.00 FALSE 200
factor occas 0 1.00 FALSE 6
factor job 0 1.00 FALSE 3
factor sex 0 1.00 FALSE 2
factor admitted 384 0.68 FALSE 2
factor semester 0 1.00 FALSE 2

Numeric data descriptive statistics:

Code
gpa %>% 
  skim() %>% 
  yank("numeric") %>% 
  gt() %>% 
  gt_highlight_rows(rows = c(1, 3),
                    fill = "lightgrey") %>% 
  gt_theme_nytimes()
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
gpa 0 1 2.8650 0.3930484 1.7 2.6 2.8 3.1 4 ▁▅▇▅▁
highgpa 0 1 2.9875 0.5948854 2.0 2.5 2.9 3.5 4 ▇▇▆▇▆
year 0 1 2.0000 0.8168370 1.0 1.0 2.0 3.0 3 ▇▁▇▁▇
occasion 0 1 2.5000 1.7085372 0.0 1.0 2.5 4.0 5 ▇▃▃▃▃

2.3 standard regersssion

Code
gpa %>% 
  lm(gpa ~ occasion, data = .) %>% 
  stargazer(type = "text", single.row = TRUE, style = "ajps")

---------------------------------------------
                               gpa           
---------------------------------------------
occasion                0.106*** (0.006)     
Constant                2.599*** (0.018)     
N                             1200           
R-squared                     0.214          
Adj. R-squared                0.213          
Residual Std. Error     0.349 (df = 1198)    
F Statistic         325.339*** (df = 1; 1198)
---------------------------------------------
***p < .01; **p < .05; *p < .1               

The above tells us that starting out, i.e. when occasion is zero, the average GPA, denoted by the intercept, is 2.6. Additionally, as we move from semester to semester, we can expect GPA to increase by about 0.11 points. A side effect of doing so is that our standard errors are biased, and thus claims about statistical significance based on them would be off.

3 Mixed Model

3.1 variance components

The code for mixed model will just like what we used for regression with lm, but with an additional component specifying the group, i.e. \(student effect\). The \((1|student)\) means that we are allowing the intercept, represent by \(1\), to vary by student.

Code
gpa %>% 
  lmer(gpa ~ occasion + (1 | student), data = .) -> gpa_mixed

gpa_mixed %>% 
  stargazer(type = "text", single.row = TRUE, style = "ajps")

-------------------------------
                     gpa       
-------------------------------
occasion       0.106*** (0.004)
Constant       2.599*** (0.022)
N                    1200      
Log Likelihood     -204.446    
AIC                416.893     
BIC                437.253     
-------------------------------
***p < .01; **p < .05; *p < .1 

Confidence intervals is as follows:

Code
gpa_mixed %>% 
  confint()
                 2.5 %    97.5 %
.sig01      0.22517423 0.2824604
.sigma      0.23071113 0.2518510
(Intercept) 2.55665145 2.6417771
occasion    0.09832589 0.1143027

We did not allow occasion to vary, so it is a constant, i.e. fixed effect for all student.

Code
gpa_mixed %>% 
  REsim() %>% 
  plotREsim()

3.2 cluster level covariates

mixed model as a multivel model:

\[\begin{align*} $gpa = b_{int\_student} + b_{occ}occasion + \epsilon$ $b_{int\_student} = b_{intercept} + student_{effect}$ \end{align*}\]

If we add student a student level covariate, e.g sex, to the model, we then have the following.

\[\begin{align*} $gpa = b_{int\_student} + b_{occ}occasion + b_{sex}sex + (student_{effect} + epsilon)$ \end{align*}\]

So, adding cluster level covariates does not have any unusual effect on how we think about the mdodel. Note also, that we can create cluster level covariates as group means or some other summary of the observation level variables. This is especially common when the cluster represent geographical units and observation are people. For example, we might have income as a person level covariate, and use the median to represent the overall wealth of the geographical region.

4 Exercise

Code
data("sleepstudy") 

glimpse(sleepstudy)
Rows: 180
Columns: 3
$ Reaction <dbl> 249.5600, 258.7047, 250.8006, 321.4398, 356.8519, 414.6901, 3…
$ Days     <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0…
$ Subject  <fct> 308, 308, 308, 308, 308, 308, 308, 308, 308, 308, 309, 309, 3…

Run a regression with Reaction as the target variable and Days as the predictor.

Code
sleepstudy %>% 
  lm(Reaction ~ Days, data = .) %>% 
  stargazer(type = "text", no.space = T, align = T, style = "ajps") 

-------------------------------------------
                           Reaction        
-------------------------------------------
Days                       10.467***       
                            (1.238)        
Constant                  251.405***       
                            (6.610)        
N                             180          
R-squared                    0.286         
Adj. R-squared               0.282         
Residual Std. Error    47.715 (df = 178)   
F Statistic         71.464*** (df = 1; 178)
-------------------------------------------
***p < .01; **p < .05; *p < .1             

Run a mixed model with a random intercept for subject.

Code
sleepstudy %>% 
  lmer(Reaction ~ Days + (1 | Subject), data = .) %>% 
  tidy()